This describes the process of building a new monitoring server from scratch. This was modeled on the HamWAN server running in our Fremont site, but build on a new vintage VM and with more modern software versions. Rather than Cacti, we use Zabbix as the monitoring engine.
Added Debian bookworm net-install ISO to proxmox server
Create a VM, and install Debian bookworm
Boot and configure as a minimal server using the graphical interface
apt install jq locate mg mtr postfix pylint python3-virtualenv redis-server redis-tools rsyslog strace sudo tcpdump
# update the locate database
updatedb
/sbin/groupadd hamadmin
add /etc/sudoers.d/hamadmin (copied from monitoring.hamwan.net)
Change /etc/ssh/sshd_config to move the SSH server to port 222, then restart sshd with systemctl restart sshd.
My initial account had to be lower case (so kd7dk). I then fixed that to my standard HamWAN KD7DK account:
Note: at this point I believe our ansible automation is capable of creating all the other netop accounts.
Plumb the management LAN to an interface.
Add routing for net 10.44.0.0/16 via mgmt LAN router. We need this to handle routing to both public and management networks without routing between them and maintaining some redundancy that we would lose with a default route. Key changes are in frr.conf and daemons to support running 2 OSPF instances.
# default to using syslog. /etc/rsyslog.d/45-frr.conf places the log in
# /var/log/frr/frr.log
#
# Note:
# FRR's configuration shell, vtysh, dynamically edits the live, in-memory
# configuration while FRR is running. When instructed, vtysh will persist the
# live configuration to this file, overwriting its contents. If you want to
# avoid this, you can edit this file manually before starting FRR, or instruct
# vtysh to write configuration to a different file.
log syslog informational
frr defaults traditional
password <PASSWORD>
enable password <PASSWORD>
log file /var/log/frr/frr.log
interface ens18
ip ospf 1 area 0
ip ospf authentication message-digest
ip ospf message-digest-key 1 md5 <OSPF_PASSWORD>
ip ospf priority 10
interface ens19
ip ospf 2 area 0
ip ospf priority 10
interface lo
router ospf 1
ospf router-id 44.25.67.58
redistribute connected
distribute-list AMPR out connected
network 44.25.67.0/26 area 0
network 44.25.0.0/23 area 0
area 0 authentication message-digest
router ospf 2
ospf router-id 10.44.4.8
redistribute connected
distribute-list MGMT out connected
network 10.44.4.0/24 area 0
network 10.44.200.0/23 area 0
access-list AMPR permit 44.0.0.0/9
access-list AMPR permit 44.128.0.0/10
access-list MGMT permit 10.44.0.0/16
Install git.
apt install git
install Docker according to https://docs.docker.com/engine/install/debian/ from https://github.com/zabbix/zabbix-docker/tree/7.2 Follow model docker-compose_v3_ubuntu_mysql_latest.yaml
Add the following to /etc/fstab: /dev/mapper/monitoring–data–vg-data /data ext4 defaults 0 2 followed by: systemctl daemon-reload
If necessary, mount /data.
\# install mysql (via mariadb fork)
apt install mariadb-server
systemctl stop mariadb
mkdir /data/mysql
chown mysql:mysql /data/mysql
Edit /etc/mysql/mariadb.conf.d/50-server.cnf datadir = /data/mysql innodb_buffer_pool_size = 8G
Add empty database to myql, grant access to ‘zabbix’ with a password.
mysql << DONE
create database zabbix;
grant all privileges on \*.\* to 'zabbix'@'localhost' identified by 'some-password';
DONE
Create a user to run zabbix containers and install the agent:
useradd -m -c "Zabbix Monitoring" zabbix
# in the container zabbix-server
chmod u+s /usr/bin/fping
# Install zabbix-agent Debian package (here and on other Linux servers)
apt install zabbix-agent
# then configure /etc/zabbix/zabbix\_agentd.conf
Server=172.16.241.0/24
ServerActive=172.16.241.3:10051
Hostname=monitoring.ziply.hamwan.net
This uses the zabbix container addresses.
On any other repeater, this would changes would look likeL
Server=44.25.67.58
ServerActive=44.25.67.58:10051
Hostname=monitoring.ziply.hamwan.net
Server can be an address, CIDR range, or list of either. See the manual for more details.
Disable unnecessary discovery in Mikrotik Template (CapsMAN, LTE)
cd ~zabbix/zabbix-docker
docker compose -f psdr.yaml --profile all up -d
cd ~zabbix/zabbix-docker
docker compose -f psdr.yaml --profile all down
docker logs -f _container_id_or_name_
If you want to change one of the Zabbix parameters that are specified in config files in /home/zabbix/zabbix-docker or a subdirectory like env_vars, you need to stop Zabbix and rebuild the effected containers.
cd ~zabbix/zabbix-docker
docker compose -f psdr.yaml --profile all down
docker compose -f psdr.yaml --profile all build --no-cache --parallel
docker compose -f psdr.yaml --profile all up -d
‘all’ can be replaced with the specific container if you know the impact of the update is limited but the safest option is to use all.
Updated the Mikrotik by SNMP template to enhance dashboards and turn off some data collection (LTE, CAPSman)
Added Template for AirFiber by SNMP, and then created a varient with the PING and inventory update disabled/removed so that it can be combined tiw Network Generic Device by SNMP which provides important additional Triggers and Graphs not available in the AirFiber template as provided.
Upgrades come in 2 pieces
Basic steps:
For a major upgrade:
Used configuration from Fremont largely unchanged. There is a key dependency on the Unfiltered.log file where the bulk of HamWAN infrastructure logs. Hacking attempts are fed into fail2ban from here. Configuration of client logging needs review and it had lots of hardcoded 44.24 addresses that were never updated. Need to consider what kind of separation we actually need.
Basically out of the box. Will probably need to update to add authentication.
Used new fail2ban.conf as starting point and merged in HamWAN changes. Added new HamWAN files from Fremont with review.
These provide a redis server and the long poll web servers for new bans. https://github.com/kd7lxl/blacklist-service
You also need to add this to 000-default-ssl.conf for apache: <Location “/blacklist”> ProxyPass “http://127.0.0.1:1234/” </Location>
This also needs the enabling of mods proxy and proxy_http: a2enmod proxy a2enmod proxy_http
Add /srv/www/keys and this stanza to 000-default-ssl.conf for apache:
Use the scripts in HamWAN Github repository.
ICMP ping loss trigger is too sensitive. Needs a longer sample interval I believe.
redis is complaining at startup: WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add ‘vm.overcommit_memory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.
S2.CapitolPark.hamwan.net MikroTik: Interface wlan1(): Link down Caused because a sector that loses its last client connections will change OperStatus to down instead of dormant. This trigger may need to be revised for APs, or possibly deleted for APs.
zabbix-server-1 | 69:20250205:143203.601 cannot send list of active checks to “172.16.241.1”: host [monitoring] not found. This is a configuration issue I believe.
Figure out how to ensure all the proxmox servers have the seme inventory of install images.
Web GUI shows a valid certificate but complains about active content with certificate errors. Something like this posting: https://community.letsencrypt.org/t/chrome-69-0-3497-81-reports-active-content-with-certificate-errors/71545 Resolution: This was caused by cached javascript content from the prior self-signed certificate. Cleared the cached content and all was fine.
Ping health checks failing. Needed to add setuid bit to fping in the server container.
SNMP agent item “net.if.wireless.walk” on host “r1.capitolpark.hamwan.net” failed: first network error, wait for 15 seconds (and similar) Possibly: https://www.zabbix.com/forum/zabbix-troubleshooting-and-problems/483095-zabbix-7-snmp-timeouts Fixed this by increasing the timeout for SNMP queries (Administration > General > Timeouts). I changed it from 3s to 10s.
3 Mikrotik hosts are refusing to respond to SNMP get/getbulk In this case s2.indianola, r3.baldi and capitolpark.queenanne. This turned out to be a semi-known issue with SNMP and Mikrotik, and asymmetric routes. RouterOS would respond with the address of whichever interface had the best route back to the requester. If that was not the interface that the request was sent to, the requester would be unable to match it with the request it sent. Solution is to give all devices a stable address (on their loopback interface if they have more than one interface), make that the default address in the portal, and set src-address in /snmp to force all SNMP response to come from that address.
Mikrotik templates (from Zabbix and third parties) https://www.zabbix.com/integrations/mikrotik
https://www.zabbix.com/documentation/current/en/manual/discovery/network_discovery